3DS : Cherry Blossom 10-Mile Race

Lyuda Bekwinknoll, Meghana Cyanam, Theresa Marie Duenas, Kevin Kiser

With our data visualization we are determining the association between age and fitness based on running data from the Cherry Blossom Ten-Mile Race held in Washington DC from 1973 to 2019.

Background:

The Credit Union Cherry Blossom (CUCB) is a non-profit organization that runs an annual 10-mile race in Washington, D.C. This race occurs the first Sunday of each April and contestants are admitted to run the race based on a lottery entry system. The lottery entry started in 1980 with 1977 being the first year that had to limit voluntary runners. Using the data collected from this race, we are exploring the relationship between aging and fitness.

Data Extraction Method:

We created a Jupyter notebook to conduct the data scraping from the webpage: <https://www.cballtimeresults.org/performance-search/?eventType=10M&year=1973&division=M&page=1>. The Python library “requests” was used to connect to the URL, and “BeatifulSoup” was used to parse through the HTML and extract the data. We wrote a function that iterated through the specified variables (section, division, year, and page) to ensure all the data was collected. The CUCB website contains data for the following columns: Name, PiD/TiD, PiS/TiS, Age, Time, Pace, Division, and Home Town. The data was scraped for men and women between 1973 and 2019. 1973 was the first year the event was held, and 2019 was the year before organizers canceled the event for the first time due to COVID-19.

We incorporated weather data from the National Oceanic and Atmospheric Administration (NOAA) and the National Centers for Environmental Information. The closest option to the marathon was the Washington Reagon Airport, Arlington, VA, situated on the banks of the Potomac across from the National Mall, where most of the marathon takes place. From that data set, we added precipitation and minimum and maximum temperatures for each day the race took place. Each event date was manually recorded from the Rite of Spring pdf that detailed the history of the marathon and used to join only the relevant data for each event date.

Describing our Data

Dataset Overview:

In the original data set we have 347402 rows and 17 columns. After cleaning the data set we ended up with 339934 rows and 17 columns. 7468 rows of data were omitted from the data we used because they had missing values for the time and/or age variables.

Variable Names Data Types Variable Descriptions How does this variable contribute to project objectives?
Year Integer Year the race was held. Data is spanned over several years giving us the ability to see how people who ran the race for multiple years changed in times. Also this contains data from a lot of years, giving us access to more data.
Date Character

Date on which the race was held in year-month-day format.

example: 1973-04-01

Excluded. Look below for more information on why.
Distance Character An alphanumeric character telling us that the race ran was “10M” 10 miles. Excluded. Look below for more information on why.
Name Character

An individual’s first and last name with varying formats. Most of the CUCB website results for names also list an ‘M’, ‘F’, or ‘W’ in parenthesis for the individual’s sex.

example: James Yenckel (M)

Used as a personal identifier source for each runner. We plan to use this variable in the future to find runners that ran several times and see how their run times changed. This will be done in our final project not necessarily in this data visualization.
Age Integer Age of runner at time of race. One of our main variables. Gives us runner ages so we can see how performance differs across age ranges and whether younger people have better times.
Time Time/Numeric Time in hr:min:sec format to run 10 miles. This is how long it took each runner to complete the race. One of our main variables. Gives us the runner times so we can see how the times spread as age changes.
Division Character

28 different divisions are contained, 14 in each sex. They range from 4 of
them having 20 year ranges, while the rest have 5 year ranges. Each division is an alphanumeric code separating competitors by sex and age. The example shows 25 to 29-year-old women.

example: W2529

This variable is used to break data into age groups which we then use to draw conclusions about our question.
pos_by_division Integer This variable gives us the position that a runner finished in their assigned division for a certain year. Excluded. Look below for more information on why.
total_by_division Integer This variable gives the total number of individuals in each division for a certain year. Excluded. Look below for more information on why.
pos_by_sex Integer Shows the place that a runner finished by sex per year. Excluded. Look below for more information on why
total_by_sex Integer The total number of competitors overall for a sex per year. Excluded. Look below for more information on why.
Sex Character Gender of runner. This is an important variable since it allows us to compare the two sexes and their times as compared to viewing them together.
Hometown Character Hometown of runner. Excluded. Look below for more information on why.
PRCP Numeric Precipitation recorded as daily rainfall in inches to one decimal place collected by NOAA. This variable does not contribute directly to our project objectives but it used to provide information about whether it rained the day of the race or not, and how much it rained.
TMIN Integer Minimum daily temperature recorded in Fahrenheit, collected by NOAA. This variable does not contribute directly to our project objectives but it’s used to provide information about the temperature minimum of the day of the race.
TMAX Integer Maximum daily temperature recorded in Fahrenheit, collected by NOAA This variable does not contribute directly to our project objectives but it’s used to provide information about the temperature maximum of the day of the race.

Below is the description of the variables and data we excluded/not used in our analysis or modified for our data analysis/visualization:

What was excluded/modified Reason for exclusion/modification
Hometown Many missing values and inconsistencies were found in the data entries. We found a few individuals reporting their hometown differently each time they ran the race or just reporting several at once. Due to this we decided to remove this variable from our analysis because there are no accurate conclusions that can be drawn. Also this is not a variable we could use to fulfill our main objective, so we chose to exclude it from our analysis.
Distance The data in this column was describing the race of this length which is 10 miles. Since we already know it is the data from the 10 mile race, having a column that explicitly states that for row of our data is redundant.
Date We decided to exclude this variable since we know that the race happens at a certain time each year during spring, and having the specific dates would not impact our data question in any way.
pos_by_sex This variable gave us the position that a runner finished based on sex of each year. We decided to not use this variable in our data visualization since it was not important to meet our goal.
total_by_sex This variable gave us the total number of people by sex per year. We decided to not include it in our analysis since it was not important for visualizing our main goal in exploring this data.
pos_by_div This variable gives us the position that a runner finished in their assigned division for a certain year. We decided to to not use this variable in our data visualization since it wasn’t important to drawing conclusion for our main goal.
total_by_division This variable gives the total number of individuals in each division for a certain year. The divisions are the same as described above and are excluded for the same reason as above.
Pace The Pace gave the pace per mile of each runner for the race. We decided to exclude this from our analysis because of the fact it wasn’t reading in correctly. Also the pace can be calculated directly from the Time variable by dividing it by 10 (the total miles in the race). Therefore we decided to remove this column of data from our data frame.
Data from the year of 1973 After we cleaned the data, we only had 2 entries left that could be analyzed for that year. Because there were only 2 entries for that year, we decided to exclude it from our analysis because it allegedly does nothing in the context of our goal and it would be kind of useless to graph only 2 points.
Data from the year of 1977 We decided to remove the data from the year of 1977 due to the fact that there was a large chunk of data missing from the times right in the middle of the race time. We are not given information about what happened in that period that resulted in such record, so we don’t have any background about that. Also if we keep this year in our data analysis, is has a potential to make our data analysis biased since there are a lot of points missing from a main part of the times, which would lead to inaccuracy in our interpretations. That is why we have decided to exclude the year from our data.
Data from the year of 2015 (not yet modified)

The distance of the race ran in 2015 was only 9.39 miles long. Because the data was still good as a whole, we decided to modify the race times using the overall time so the data would be as that of a 10 mile race. We did not have the time to do it for this data visualization project, but plan on doing it with our final.

We plan on making this modification by finding the pace of each runner in 2015, dividing that by 9.39 to obtain their pace per mile, and multiplying it by 10 to get their time if they kept that pace for 10 miles.. The times included here will not the actual times of the runners since they have been modified, but the times are “standardized” to be that of a 10 mile race to remove the bias of the race being shorter for this year. If this race would have been the full 10 miles, the times would have been slightly different, but that difference would have been very minimal and insignificant from our modified data of this year for this analysis. Therefore, we decided that this would be a good way to approach the times for this years race instead of just removing all of the data as a whole.

Data from the year of 2019 (not yet modified)

The distance of the race ran in 2019 was 80 yards shy of 10 miles. Due to this we decided to “standardize” the times for the same reasons as above. We haven’t done it yet for this data visualization project since we ran out of time, but we plan on doing so for our final.

We plan on doing this modification by finding the average pace for each runner in that year and multiplying that by 10 for each runner, giving us the race times for the runner for 10 miles. The same limitations and things should be taken into account for this new time as above. Our group decided it was best to modify this data then to remove it.

Loading and Cleaning Data

Summary Statistics:

Year, Age, Time, Sex main variables to focus on.

Checklist for this section:

summary stats: mean, median, mode, range, sd, percentiles, distributions by sex variable, etc.

mention how many women and how many men in each year and overall

Data Frame Summary

df

Dimensions: 339214 x 7
Duplicates: 11759

Variable Stats / Values Freqs (% of Valid) Graph Missing
Year
[integer]
Mean (sd) : 2006.4 (10.5)
min < med < max:
1974 < 2009 < 2019
IQR (CV) : 14 (0)
45 distinct values 0
(0.0%)
Age
[integer]
Mean (sd) : 36.6 (10.3)
min < med < max:
8 < 35 < 87
IQR (CV) : 14 (0.3)
80 distinct values 0
(0.0%)
Time
[times]
Mean (sd) : 00:00:00 (0)
min < med < max:
00:00:00 < 00:00:00 < 00:00:00
IQR (CV) : 0 (00:00:00)
5649 distinct values 0
(0.0%)
Sex
[character]
1. F
2. M
166755 (49.2%)
172459 (50.8%)
0
(0.0%)
PRCP
[numeric]
Mean (sd) : 0.1 (0.1)
min < med < max:
0 < 0 < 0.9
IQR (CV) : 0 (2.4)
17 distinct values 0
(0.0%)
TMAX
[integer]
Mean (sd) : 63.3 (8.3)
min < med < max:
44 < 64 < 84
IQR (CV) : 14 (0.1)
24 distinct values 0
(0.0%)
TMIN
[integer]
Mean (sd) : 43.1 (5.6)
min < med < max:
32 < 43 < 58
IQR (CV) : 8 (0.1)
23 distinct values 0
(0.0%)

Data Frame Summary

df

Group: Sex = F
Dimensions: 166755 x 7
Duplicates: 7844

Variable Stats / Values Freqs (% of Valid) Graph Missing
Year
[integer]
Mean (sd) : 2009.2 (8.3)
min < med < max:
1974 < 2011 < 2019
IQR (CV) : 10 (0)
45 distinct values 0
(0.0%)
Age
[integer]
Mean (sd) : 34.6 (9.5)
min < med < max:
8 < 32 < 87
IQR (CV) : 13 (0.3)
80 distinct values 0
(0.0%)
Time
[times]
Mean (sd) : 00:00:00 (0)
min < med < max:
00:00:00 < 00:00:00 < 00:00:00
IQR (CV) : 0 (00:00:00)
5175 distinct values 0
(0.0%)
PRCP
[numeric]
Mean (sd) : 0 (0.1)
min < med < max:
0 < 0 < 0.9
IQR (CV) : 0 (2.6)
17 distinct values 0
(0.0%)
TMAX
[integer]
Mean (sd) : 64 (8)
min < med < max:
44 < 66 < 84
IQR (CV) : 14 (0.1)
24 distinct values 0
(0.0%)
TMIN
[integer]
Mean (sd) : 43 (5.2)
min < med < max:
32 < 43 < 58
IQR (CV) : 8 (0.1)
23 distinct values 0
(0.0%)

Group: Sex = M
Dimensions: 172459 x 7
Duplicates: 3915

Variable Stats / Values Freqs (% of Valid) Graph Missing
Year
[integer]
Mean (sd) : 2003.7 (11.6)
min < med < max:
1974 < 2007 < 2019
IQR (CV) : 18 (0)
45 distinct values 0
(0.0%)
Age
[integer]
Mean (sd) : 38.5 (10.7)
min < med < max:
8 < 37 < 87
IQR (CV) : 15 (0.3)
80 distinct values 0
(0.0%)
Time
[times]
Mean (sd) : 00:00:00 (0)
min < med < max:
00:00:00 < 00:00:00 < 00:00:00
IQR (CV) : 0 (00:00:00)
5556 distinct values 0
(0.0%)
PRCP
[numeric]
Mean (sd) : 0.1 (0.1)
min < med < max:
0 < 0 < 0.9
IQR (CV) : 0.1 (2.2)
17 distinct values 0
(0.0%)
TMAX
[integer]
Mean (sd) : 62.7 (8.5)
min < med < max:
44 < 63 < 84
IQR (CV) : 14 (0.1)
24 distinct values 0
(0.0%)
TMIN
[integer]
Mean (sd) : 43.2 (6)
min < med < max:
32 < 43 < 58
IQR (CV) : 8 (0.1)
23 distinct values 0
(0.0%)
 plot_age_dist <- ggplot(df, aes(x = Age, y = as.factor(Year))) +
    geom_density_ridges_gradient(
      aes(fill = ..x..), scale = 3, size = 0.3
    ) +
    scale_fill_gradientn(
      colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"),
      name = "Age"
    ) +
    labs(title = 'Age Distribution by Year', y="")
  
plot_age_dist
## Warning: The dot-dot notation (`..x..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(x)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Picking joint bandwidth of 1.6

custom_ticks <- c("00:43", "01:07", "01:30", "01:56", "02:20")
tick_positions <- seq(min(df$Time), max(df$Time), length.out = length(custom_ticks))
tick_labels <- times(tick_positions)

  # Plotting density ridgeline plot for Time by Year
 plot_time_dist <-  ggplot(df, aes(x = Time, y = as.factor(Year))) +
    geom_density_ridges_gradient(
      aes(fill = ..x..), scale = 3, size = 0.3
    ) +
    scale_fill_gradientn(
      colours = c("red", "purple", "blue"),
      name = "Time to finish",
      breaks = tick_positions,
      labels = custom_ticks
    ) +
   scale_x_continuous(labels = tick_labels, breaks = tick_positions
    ) +
    labs(title = 'Time Distribution by Year', y = "")

 plot_time_dist
## Picking joint bandwidth of 0.00152

# Creating a scatterplot function that inputs the current year and data and
# outputs scatterplot for the year and its trend line

scat_plot <- function(curr_year, df) {
  # Subsetting data by given year
  sub_data <- df %>% filter(Year == curr_year)
  
  # Scatterplot using ggplot2
  ggplot(sub_data, aes(x = Age, y = as.numeric(Time))) +
    geom_point(col = "blue", shape = 1) +
    geom_smooth(method = "lm", se = FALSE, col = "red") +
    labs(title = as.character(curr_year), x = "Age (years)", y = "Time (hh:mm)") +
    scale_y_continuous(labels = custom_ticks, breaks = tick_positions)
}

# Setting up the layout
par(mfrow = c(2, 4))

# Loops over years 1973:2019 and calls the scat_plot function
# unique(Year) skips 1977 and will work still work if we remove other years
for (curr_year in unique(df$Year)) {
  print(scat_plot(curr_year, df))
}
## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

df %>%
  ggplot() +
  geom_point(aes(x = Age, y = Time)) + 
  facet_wrap(~Sex) 
## Don't know how to automatically pick scale for object of type <times>.
## Defaulting to continuous.